💓Heart Attack Data Analysis🔎

🩺(Prediction at the end)🔮

Titanic

What is a heart attack?

A heart attack, also called a myocardial infarction, happens when a part of the heart muscle doesn’t get enough blood.

The more time that passes without treatment to restore blood flow, the greater the damage to the heart muscle.

Coronary artery disease (CAD) is the main cause of heart attack. A less common cause is a severe spasm, or sudden contraction, of a coronary artery that can stop blood flow to the heart muscle.

What are the symptoms of heart attack?

The major symptoms of a heart attack are :

  • Chest pain or discomfort. Most heart attacks involve discomfort in the center or left side of the chest that lasts for more than a few minutes or that goes away and comes back. The discomfort can feel like uncomfortable pressure, squeezing, fullness, or pain.

  • Feeling weak, light-headed, or faint. You may also break out into a cold sweat.

  • Pain or discomfort in the jaw, neck, or back.

  • Pain or discomfort in one or both arms or shoulders.

  • Shortness of breath. This often comes along with chest discomfort, but shortness of breath also can happen before chest discomfort.

Exploratory Data Analysis¶

Aim :¶

  • Understand the data ("A small step forward is better than a big one backwards")
  • Begin to develop a modelling strategy

Features¶

  • Age : Age of the patient

  • Sex : Sex of the patient

  • exang: exercise induced angina (1 = yes; 0 = no)

  • ca: number of major vessels (0-3)

  • cp : Chest Pain type chest pain type :

    • Value 1: typical angina
    • Value 2: atypical angina
    • Value 3: non-anginal pain
    • Value 4: asymptomatic
  • trtbps : resting blood pressure (in mm Hg)

  • chol : cholestoral in mg/dl fetched via BMI sensor

  • fbs : (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)

  • rest_ecg : resting electrocardiographic results :

    • Value 0: normal
    • Value 1: having ST-T wave abnormality (T wave inversions and/or ST elevation or depression of > 0.05 mV)
    • Value 2: showing probable or definite left ventricular hypertrophy by Estes' criteria
  • thalach : maximum heart rate achieved

  • target : 0= less chance of heart attack 1= more chance of heart attack

Base Checklist¶

Shape Analysis :¶

  • target feature : output
  • rows and columns : 303 , 14
  • features types : qualitatives : 0 , quantitatives : 14
  • NaN analysis :
    • NaN (0 % of NaN)

Columns Analysis :¶

  • Target Analysis :
    • Balanced (Yes/No) : Yes
    • Percentages : 55% / 45%
  • Categorical values
    • There is 8 categorical features (0/1) (not inluding the target)
In [2]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.decomposition import PCA

from dataprep.eda import create_report
from dataprep.eda import plot_missing
from dataprep.eda import plot_correlation
from dataprep.eda import plot

from sklearn.model_selection import train_test_split

from sklearn.metrics import f1_score, accuracy_score, confusion_matrix, classification_report, roc_curve
from sklearn.model_selection import learning_curve, cross_val_score, GridSearchCV
from sklearn.model_selection import train_test_split

from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.pipeline import make_pipeline
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import RobustScaler,StandardScaler,MinMaxScaler

import warnings
warnings.filterwarnings('ignore')

Dataset Analysis¶

In [3]:
data = pd.read_csv('../input/heart-attack-analysis-prediction-dataset/heart.csv')
df = data.copy()
pd.set_option('display.max_row',df.shape[0])
pd.set_option('display.max_column',df.shape[1]) 
df.head()
Out[3]:
age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
0 63 1 3 145 233 1 0 150 0 2.3 0 0 1 1
1 37 1 2 130 250 0 1 187 0 3.5 0 0 2 1
2 41 0 1 130 204 0 0 172 0 1.4 2 0 2 1
3 56 1 1 120 236 0 1 178 0 0.8 2 0 2 1
4 57 0 0 120 354 0 1 163 1 0.6 2 0 2 1
In [4]:
(df.isna().sum()/df.shape[0]*100).sort_values(ascending=False)
Out[4]:
age         0.0
sex         0.0
cp          0.0
trtbps      0.0
chol        0.0
fbs         0.0
restecg     0.0
thalachh    0.0
exng        0.0
oldpeak     0.0
slp         0.0
caa         0.0
thall       0.0
output      0.0
dtype: float64
In [5]:
plot_missing(df)
Out[5]:
DataPrep.EDA Report

Missing Statistics

Missing Cells0
Missing Cells (%)0.0%
Missing Columns0
Missing Rows0
Avg Missing Cells per Column0.0
Avg Missing Cells per Row0.0
'height': 500
Height of the plot
'width': 500
Width of the plot
'spectrum.bins': 20
Number of bins
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot
'height': 500
Height of the plot
'width': 500
Width of the plot
In [6]:
print('There is' , df.shape[0] , 'rows')
print('There is' , df.shape[1] , 'columns')
There is 303 rows
There is 14 columns
In [7]:
df.duplicated().sum()
Out[7]:
1
In [8]:
df.loc[df.duplicated(keep=False),:]
Out[8]:
age sex cp trtbps chol fbs restecg thalachh exng oldpeak slp caa thall output
163 38 1 2 138 175 0 1 173 0 0.0 2 4 2 1
164 38 1 2 138 175 0 1 173 0 0.0 2 4 2 1
In [9]:
df.drop_duplicates(keep='first',inplace=True)
df.shape
Out[9]:
(302, 14)

Visualising Target and Features¶

In [10]:
df['output'].value_counts(normalize=True) #Classes déséquilibrées
Out[10]:
1    0.543046
0    0.456954
Name: output, dtype: float64
In [ ]:
for col in df.select_dtypes(include=['float64','int64']):
    plt.figure()
    sns.displot(df[col],kind='kde',height=3)
    plt.show()
In [12]:
X = df.drop('output',axis=1)
y = df['output']

Detailed Analysis¶

In [13]:
riskyDF = df[y == 1]
safeDF = df[y == 0]
In [14]:
plt.figure(figsize=(4,4))
sns.pairplot(data,height=1.5)
plt.show()
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
In [15]:
corr = df.corr(method='pearson').abs()

fig = plt.figure(figsize=(8,6))
sns.heatmap(corr, annot=True, cmap='tab10', vmin=-1, vmax=+1)
plt.title('Pearson Correlation')
plt.show()

print (df.corr()['output'].abs().sort_values())
No description has been provided for this image
fbs         0.026826
chol        0.081437
restecg     0.134874
trtbps      0.146269
age         0.221476
sex         0.283609
thall       0.343101
slp         0.343940
caa         0.408992
thalachh    0.419955
oldpeak     0.429146
cp          0.432080
exng        0.435601
output      1.000000
Name: output, dtype: float64
In [16]:
for col in df.select_dtypes(include=['float64','int64']):
    plt.figure(figsize=(4,4))
    sns.distplot(riskyDF[col],label='High Risk')
    sns.distplot(safeDF[col],label='Low Risk')
    plt.legend()
    plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Comments¶

It looks like we have some very useful features here, with a correlation > 0.4. The following features seems promising for predicting wether a patient will have a heart attack or not :

  • oldpeak
  • exng
  • cp
  • thalachh

We can also notice that sip and oldpeak looks correlated, let's find out !

In [17]:
for col in X.select_dtypes(include=['float64','int64']):
    plt.figure(figsize=(4,4))
    sns.lmplot(x='oldpeak', y=col, hue='output', data=df)
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image
<Figure size 288x288 with 0 Axes>
No description has been provided for this image

A bit of data engineering ...¶

In [19]:
def encoding(df):
    code = {
            # All columns are made of quantitative values (floats actually), so there is no need to encode the features
           }
    for col in df.select_dtypes('object'):
        df.loc[:,col]=df[col].map(code)
        
    return df

def imputation(df):
    
    df = df.dropna(axis=0) # There are no NaN anyways
    
    return df

def feature_engineering(df):
    useless_columns = [] # Let's consider we want to use all the features
    df = df.drop(useless_columns,axis=1)
    return df
In [20]:
def preprocessing(df):
    df = encoding(df)
    df = feature_engineering(df)
    df = imputation(df)
    
    X = df.drop('output',axis=1)
    y = df['output']    
      
    return df,X,y

Comments¶

We can now analyze categorical features as quantitative features (rem : no qualitative features to be encoded here)

Modelling¶

In [21]:
df = data.copy()
trainset, testset = train_test_split(df, test_size=0.2, random_state=0)
print(trainset['output'].value_counts())
print(testset['output'].value_counts())
1    131
0    111
Name: output, dtype: int64
1    34
0    27
Name: output, dtype: int64
In [22]:
_, X_train, y_train = preprocessing(trainset)
_, X_test, y_test = preprocessing(testset)
In [23]:
preprocessor = make_pipeline(MinMaxScaler())

PCAPipeline = make_pipeline(StandardScaler(), PCA(n_components=2,random_state=0))

RandomPipeline = make_pipeline(preprocessor,RandomForestClassifier(random_state=0))
AdaPipeline = make_pipeline(preprocessor,AdaBoostClassifier(random_state=0))
SVMPipeline = make_pipeline(preprocessor,SVC(random_state=0,probability=True))
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier())
LRPipeline = make_pipeline(preprocessor,LogisticRegression(solver='sag'))

PCA Analysis¶

In [24]:
PCA_df = pd.DataFrame(PCAPipeline.fit_transform(X))
PCA_df = pd.concat([PCA_df, y], axis=1)
PCA_df.head()
Out[24]:
0 1 output
0 0.603024 2.291914 1.0
1 -0.478588 -0.988416 1.0
2 -1.847655 0.020559 1.0
3 -1.724377 -0.490040 1.0
4 -0.403288 0.278693 1.0
In [25]:
plt.figure(figsize=(8,8))
sns.scatterplot(PCA_df[0],PCA_df[1],hue=PCA_df['output'],palette=sns.color_palette("tab10", 2))
plt.show()
No description has been provided for this image

Classification problem¶

In [26]:
dict_of_models = {'RandomForest': RandomPipeline,
'AdaBoost': AdaPipeline,
'SVM': SVMPipeline,
'KNN': KNNPipeline,
'LR': LRPipeline}
In [27]:
def evaluation(model):
    model.fit(X_train, y_train)
    # calculating the probabilities
    y_pred_proba = model.predict_proba(X_test)

    # finding the predicted valued
    y_pred = np.argmax(y_pred_proba,axis=1)
    print('Accuracy = ', accuracy_score(y_test, y_pred))
    print('-')
    print(confusion_matrix(y_test,y_pred))
    print('-')
    print(classification_report(y_test,y_pred))
    print('-')
    
    N, train_score, val_score = learning_curve(model, X_train, y_train, 
                                               cv=4, scoring='f1', 
                                               train_sizes=np.linspace(0.1,1,10))
    plt.figure(figsize=(12,8))
    plt.plot(N, train_score.mean(axis=1), label='train score')
    plt.plot(N, val_score.mean(axis=1), label='validation score')
    plt.legend()
In [28]:
for name, model in dict_of_models.items():
    print('---------------------------------')
    print(name)
    evaluation(model)
---------------------------------
RandomForest
Accuracy =  0.8852459016393442
-
[[24  3]
 [ 4 30]]
-
              precision    recall  f1-score   support

           0       0.86      0.89      0.87        27
           1       0.91      0.88      0.90        34

    accuracy                           0.89        61
   macro avg       0.88      0.89      0.88        61
weighted avg       0.89      0.89      0.89        61

-
---------------------------------
AdaBoost
Accuracy =  0.9016393442622951
-
[[25  2]
 [ 4 30]]
-
              precision    recall  f1-score   support

           0       0.86      0.93      0.89        27
           1       0.94      0.88      0.91        34

    accuracy                           0.90        61
   macro avg       0.90      0.90      0.90        61
weighted avg       0.90      0.90      0.90        61

-
---------------------------------
SVM
Accuracy =  0.8524590163934426
-
[[21  6]
 [ 3 31]]
-
              precision    recall  f1-score   support

           0       0.88      0.78      0.82        27
           1       0.84      0.91      0.87        34

    accuracy                           0.85        61
   macro avg       0.86      0.84      0.85        61
weighted avg       0.85      0.85      0.85        61

-
---------------------------------
KNN
Accuracy =  0.8852459016393442
-
[[22  5]
 [ 2 32]]
-
              precision    recall  f1-score   support

           0       0.92      0.81      0.86        27
           1       0.86      0.94      0.90        34

    accuracy                           0.89        61
   macro avg       0.89      0.88      0.88        61
weighted avg       0.89      0.89      0.88        61

-
---------------------------------
LR
Accuracy =  0.8360655737704918
-
[[20  7]
 [ 3 31]]
-
              precision    recall  f1-score   support

           0       0.87      0.74      0.80        27
           1       0.82      0.91      0.86        34

    accuracy                           0.84        61
   macro avg       0.84      0.83      0.83        61
weighted avg       0.84      0.84      0.83        61

-
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Comments¶

All 5 models look promising, but AdaBoost has a slightly better accuracy **(90%)****¶

Using AdaBoost¶

In [29]:
AdaPipeline.fit(X_train, y_train)
y_proba = AdaPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)

print("Adaboost : ", accuracy_score(y_test, y_pred))
Adaboost :  0.9016393442622951
In [30]:
y_pred_prob = AdaPipeline.predict_proba(X_test)[:,1]

fpr,tpr,threshols=roc_curve(y_test,y_pred_prob)

plt.plot(fpr,tpr,label='AdaBoost ROC Curve')
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("AdaBoost ROC Curve")
plt.show()
No description has been provided for this image

Using KNN¶

In [31]:
KNNPipeline.fit(X_train, y_train)
y_proba = KNNPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)

print("KNN : ", accuracy_score(y_test, y_pred))
KNN :  0.8852459016393442

KNN Optimization¶

In [32]:
err = []
  
for i in range(1, 40):
    
    model = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = i))
    model.fit(X_train, y_train)
    pred_i = model.predict(X_test)
    err.append(np.mean(pred_i != y_test))
  
plt.figure(figsize =(10, 8))
plt.plot(range(1, 40), err, color ='blue',
                linestyle ='dashed', marker ='o',
         markerfacecolor ='blue', markersize = 8)
  
plt.title('Mean Err = f(K)')
plt.xlabel('K')
plt.ylabel('Mean Err')
Out[32]:
Text(0, 0.5, 'Mean Err')
No description has been provided for this image
In [33]:
KNNPipeline = make_pipeline(preprocessor,KNeighborsClassifier(n_neighbors = 7))
KNNPipeline.fit(X_train, y_train)
y_proba = KNNPipeline.predict_proba(X_test)
y_pred = np.argmax(y_proba,axis=1)

print("KNN : ", accuracy_score(y_test, y_pred))
KNN :  0.9016393442622951
In [ ]: